Prediction models used in the progression of chronic kidney disease: A scoping review

Objective To provide a review of prediction models that have been used to measure clinical or pathological progression of chronic kidney disease (CKD). Design Scoping review. Data sources Medline, EMBASE, CINAHL and Scopus from the year 2011 to 17th February 2022. Study selection All English written studies that are published in peer-reviewed journals in any country, that developed at least a statistical or computational model that predicted the risk of CKD progression. Data extraction Eligible studies for full text review were assessed on the methods that were used to predict the progression of CKD. The type of information extracted included: the author(s), title of article, year of publication, study dates, study location, number of participants, study design, predicted outcomes, type of prediction model, prediction variables used, validation assessment, limitations and implications. Results From 516 studies, 33 were included for full-text review. A qualitative analysis of the articles was compared following the extracted information. The study populations across the studies were heterogenous and data acquired by the studies were sourced from different levels and locations of healthcare systems. 31 studies implemented supervised models, and 2 studies included unsupervised models. Regardless of the model used, the predicted outcome included measurement of risk of progression towards end-stage kidney disease (ESKD) of related definitions, over given time intervals. However, there is a lack of reporting consistency on details of the development of their prediction models. Conclusions Researchers are working towards producing an effective model to provide key insights into the progression of CKD. This review found that cox regression modelling was predominantly used among the small number of studies in the review. This made it difficult to perform a comparison between ML algorithms, more so when different validation methods were used in different cohort types. There needs to be increased investment in a more consistent and reproducible approach for future studies looking to develop risk prediction models for CKD progression.


Introduction
Chronic Kidney Disease (CKD) is a global health burden with an estimated 5 to 10 million annual deaths worldwide due to kidney disease [1,2]. Current data predict CKD will be the fifth leading cause of death worldwide by the year 2040 [3]. CKD is characterised by a gradual loss of the kidney's ability to remove wastes from the blood, and the severity of the disease is determined by the individual's estimated glomerular filtration rate (eGFR) [4]. CKD is arbitrarily categorised into five progressive stages with stage five often referred as end-stage kidney disease (ESKD), and its progression often leads to multiple overlapping complications [5,6]. There is a spectrum of pathological, hereditary, and sociodemographic factors known to contribute to a decline in kidney function [5][6][7][8][9][10][11]. These factors include age (�60 years), smoking, low socioeconomic status, diabetes, hypertension, cardiovascular disease, body mass index (�30 kg/m 2 ), family history of kidney disease and use of pain-reliving medications [9][10][11].
The global nephrology community recognises that current models of care are insufficient to curb the growing CKD burden and that new care models are required to improve patient outcomes [12][13][14]. It has been suggested that the management framework for CKD needs to consider the disease across the entire life course of each individual [13]. New care models also need to consider improvements in areas such as disease surveillance, mitigation of risk factors, expanding research knowledge, and developing novel clinical interventions to slow the progression of CKD [13]. Despite having identified a number of risk factors associated with the onset of CKD, gaps remain in the methods for predicting the risk of CKD progression and interventions to slow CKD progression [13,15,16]. In addition, a large number of patients with CKD remain undetected through health systems [16] and clinicians have the challenge of managing the growing number of cases with limited tools for triaging patients.
prediction models that can improve our ability to identify individuals at risk, in addition to potentially improving our understanding of the natural history of disease progression and contribute to the clinical management of CKD [22,24,25]. The application of ML models provides capacity to tap into the information contained in large and complex datasets and exploit the complex non-linear dependencies [18,21,23,[26][27][28]. The application of these analytical techniques promises to improve our understanding of CKD progression and inform key interventions to help slow progression and reduce the burden of CKD [11,[29][30][31]. Moreover, it can help inform clinicians with regards to treatment options by increasing confidence in the patient's likely prognostic course [32,33].
Whilst the use of predictive modelling is gaining traction in CKD research, efforts are beset by the lack of a uniform approach to the reporting of important methodological advancements and developments of prediction models for CKD progression [23][24][25]34]. This lack of consistent reporting of key characteristics and the evaluation of model performance has likely impeded uptake and support of prediction models by clinicians, while undermining reproducibility of research and clinical utility [24]. An example of a standardised reporting guidelines can be seen with the Equator Network who published the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement that consists of a checklist considered vital by healthcare professionals, methodologists and journal editors, for the transparent reporting of multivariable prediction model studies [35,36]. By implementing such a checklist, reporting can be standardised and reproducibility improved while facilitating progress towards cross-validation between different health settings and populations globally.
With inconsistency in the advancements of predictive modelling used in CKD progression analysis, this paper provides timely evidence from a scoping review about prediction models used in the progression of CKD. The review aims to 1) Identify and outline existing models used in predicting CKD progression; 2) To understand what measured outcome(s) and selected significant variables were chosen when building a prediction model for CKD progression. Its results will help inform clinical and scientific developments in this area and provide a better understanding of CKD progression.

Classification of predictive models
Predictive modelling techniques can be generally classified into four broad categories; supervised, unsupervised, semi-supervised and reinforcement learning; with supervised and unsupervised being the most commonly applied in the medical field [18,22,25,33]. This was also reflected in this scoping review where only supervised and unsupervised techniques were found in the studies that were assessed for full-text and will be discussed in later sections.
Supervised techniques can be further divided by the type of outcome they predict, with the two major groupings including continuous outcomes and categorical outcomes [37]. The regression technique is utilised when output variables are continuous data, such as values for weight or height [37]. On the other hand, classification techniques are commonly used for simpler data such as nominal or categorical data, where a simple binary outcome or a few predetermined categorical responses are required [37]. Supervised techniques have their own challenges and require sufficiently large volumes of correctly labelled data initially to perform accurately [26]. Some examples of commonly used supervised machine learning algorithms are linear or logistic regression, artificial neural networks, decision trees, k-nearest neighbours (KNN), random forest for classification, gradient boosting and support vector machines (SVM) [37,38].
Unsupervised techniques can be further grouped into 2 types, clustering or association. Clustering is the process of segregating data into groups according to similar characteristics, whereas association is the process of identifying newer relationships within datasets based on certain selected attributes of the data. Additionally, unsupervised algorithms do not need manual labelling of datasets, as they can group data into clusters or identifying associations by themselves [26,38]. The end result of these methods is to provide a simplified interpretation of a complex dataset, and often to sort observations into groups [38]. These groups can then be inspected for their ability to predict the outcome of interest. Some common examples of unsupervised ML algorithms include K-means clustering, mixture models, distribution models, dimensionality reduction, independent component analysis and principal component analysis.

Methods
A scoping review was selected as it allows identification and mapping of existing evidence and to investigate and determine the knowledge gaps surrounding the topic [39]. This method is suitable for examining emerging evidence across a broad field of study and was guided by the PRISMA extension for Scoping Reviews (PRISMA-ScR), following a standardised approach to search, screen, and report articles [40].

Data sources and searches
This scoping review was performed in the context of a larger study that investigates improving chronic kidney disease outcomes using linked routine records. With this context in mind, an initial concept grid was developed to address the objectives of the scoping review, together with the subsequent search histories that can be found in S1 Appendix. The review included studies in the past 10 years that developed or utilised any type of predictive modelling to predict the progression of CKD towards more severe stage of the disease. Articles included were published in peer-reviewed journals from any country, in the English language, between 1 st January 2011 to 17 th February 2022 inclusive. The which addresses the objectives of the scoping review. Four electronic databases, Medline, EMBASE, CINAHL, and Scopus were chosen for their bibliographic peer-reviewed publications that covers a broad range of medical life sciences, allied health, nursing and healthcare.

Study selection and search
Four main overarching concepts, as described in the concept grid, were selected for the development of the search strategy, they were: kidney disease; disease progression; techniques; outcomes. The initial search strategy was developed for use in Medline and subsequently adapted for the other databases-keywords and sub headers were amended to reflect search terms used in each respective database. The steps used in Medline are as follows: 1. (chronic kidney disease � or chronic renal disease � or CKD or kidney disease � or kidney failure).ti,ab. 13. 3 and 6 and 9 and 12 14. limit 13 to (english language and yr = "2011 -Current") The first key concept for kidney disease included keywords and MeSH terms used in steps 1 and 2, to capture different types of chronic kidney diseases, such as diabetic nephropathies or similar diseases, since it is chronic disease with multiple overlapping manifestations with associated comorbidities and risk factors [39]. The type of model used was not limited and included either statistical or ML algorithms used to predict CKD progression towards a wide range of clinical outcomes. A clear distinction was made that the study should examine prediction models for CKD progression, rather than models that predicted the onset of CKD.

Title and abstract screening
All articles were exported into EndNote and duplicate articles were removed. Two independent reviewers performed title and abstract screening by applying inclusion and exclusion criteria. Studies included in the review were based on inclusion criteria, which included an implementation of a predictive model that was developed through analysis of health records; and they also had to include a reported outcome on the progression of CKD. The list of exclusion criteria can be found in Table 1.
The authors recognised that CKD is a very broad topic and did not place restrictions on the type of predictive model that was developed, the population of interest, the source of health data records, the predictive variables that were used, or a specific outcome. If there were any disagreements to the exclusion of articles, it was resolved through a discussion between the two reviewers-if required, a third reviewer for adjudication.

PLOS ONE
Prediction modelling on CKD progression: A Scoping Review

Data extraction and quality assessment
The researchers wanted to better understand the significant considerations taken into account when developing a prediction model for CKD progression, and to explore how these studies measured CKD progression [35]. The information extracted followed the items listed on the TRIPOD statement such as the article's title, author(s), publication year, year of study period, study locations and population size, study design (retrospective or prospective), predicted outcome(s), type of prediction model, predictors in the model, validation assessment, limitations, implications, eGFR formula and data balancing. Corresponding authors were contacted by email if full text was not available and were excluded if unobtainable.

Results
The initial search had a combined total of 516 articles across Medline, EMBASE, CINAHL and Scopus, of which 188 duplicates were removed. 328 articles were then screened for their title and abstract, of which 245 articles were excluded based on exclusion criteria. 83 articles were then assessed for full-text eligibility by inclusion criteria, and subsequently 33 articles remained and were included in final qualitative review. Table 2 summarises the final articles that were included for full-text review.
This risk of ESKD was generally predicted for specified time intervals of 1, 2, 3, and 5 years for supervised models, and shorter time intervals of 3, 6, 12 and 18 months for unsupervised models. There were very few studies that had predicted outcomes such as progression from an earlier stage to a more severe stage of CKD, for example from stage 1 to stages 3 or 4 [54,56], and other predicted endpoints of stated percentage decline in eGFR levels [9,41]. Depending on the quality of the available dataset [59,60], the predicted outcome could also be combined with other variables such death, comorbidities, the type of dialysis and the time of diagnosis [46,59,68]. Some examples of outcomes that integrated these additional variables include, predicting the chances of future KRT at the time of CKD diagnosis [70]; a � 50% decline in the eGFR from baseline [50] or an eGFR decline �30% from baseline [41]; the 5-year risk of KRT in CKD stage 3 and 4 [56]; the mortality and progression to ESKD over five years [65]. Fig 1 shows that 31 studies implemented supervised models, and only 2 studies included unsupervised models with 1 of these 2 studies being a comparison study between supervised and unsupervised models. Of the studies that used supervised models, 21 studies implemented cox proportional hazards regression . Seven studies used machine learning (ML) methods [9,[67][68][69][70][71], and one compared the performance among a number of ML techniques [70]. One study developed a model using Random Forest regression [68], and another study implemented a disease2disease model by first learning the International Classification of Diseases and then clustering the data into groups by considering the variables within the dataset [69]. A multistate marginal structural model (MS-MSM) was also developed in one study that

PLOS ONE
Prediction modelling on CKD progression: A Scoping Review

PLOS ONE
Prediction modelling on CKD progression: A Scoping Review  considers an estimated effect of time-dependent variables towards the predicted outcome [72].

Significant variables in the model
Common predictors used in studies included age, sex, eGFR, urinary albumin to creatinine ratio (ACR), serum creatinine (SCr), diabetes, cardiovascular disease, body mass index (BMI), and high blood pressure. Each predictive model was unique and incorporated different combinations of variables, and slightly different definitions of variables, such as high blood pressure. A recent paper by Xu et al. [61] published in 2021 highlighted that there are currently no robust biomarkers to predict progressive CKD, but rather relied on multiple longitudinal kidney measurements, such as eGFR and proteinuria. The eGFR formula was also not consistent across studies, 13 studies used the CKD Epidemiology Collaboration (CKD-EPI) equation [9, 41, 42, 44, 46, 51, 54-56, 58, 63, 65, 66] and 9 studies used the Modification of Diet in Renal Disease (MDRD) equation [43,45,47,49,53,60,62,64,71]. Two studies used unique equations customised for their specific cohort [48,69]. There were also 9 studies that did not specify the formula that they used to calculate the eGFR.

Validation assessment
30 papers reported on the performance of their respective predictive models (regardless of the type of prediction model used) with 25 studies assessing the performance of their model by measuring the Area Under the Curve (AUC) [9, 41, 43, 45-54, 56-62, 64, 66-68, 70]. Both supervised and unsupervised techniques were shown to have used AUC to validate their prediction model, each having a relatively high value that indicated that their model was reliable in predicting their defined outcome. Relative performance of the prediction model was indicated using a variety of methods including sensitivity analysis, specificity, discrimination index and a goodness of fit analysis. However, only three studies were externally validated on an external population dataset [46,56,64].
Four studies explored the KFRE [41, 49,64,65] as a variable to try and improve the performance of their prediction model. Only one study reported using the F-score with confidence intervals [67], and there were a range of alternative measures that were used including the mean square error, mean absolute error, normalised mean square error, positive predictive values, negative predicted values, Harrell bootstrap resampling method, D-statistic and various confusion matrices [56,68,71].

Missing data & imbalanced data
The most common limitation reported was missing or limited data, potentially due to the quality and availability of the data collected. Studies tried to overcome this issue by filling in the missing data using imputation techniques and internal validation techniques to help justify the dataset [42,60]. There were also studies that reported having unbalanced data and outlined the methods applied to re-balance the data before initiating model development [9,42,45,62,70].

Discussion
The arrival of big data and data science techniques have supported better analytics using data from a variety of sources. However, many healthcare systems around the world are yet to fully utilise healthcare data for research purposes. Many of the data challenges within health relate to missing data, inconsistencies in recorded data and privacy concerns for linking data across organisations [75]. Despite these challenges, the application of health data is critical to support clinical decision making [31, 76,77].
The success of disease management for conditions like CKD is dependent upon a clinician's ability to identify the risk of disease progression and poor outcomes. By utilising big data analytics, healthcare professionals may be able to predict disease progression in a timely manner, allowing the potential for better treatment for patients and reduced health costs.
Our review identified studies that had developed models to predict patient outcomes for CKD that measured the risk of progression towards ESKD over given time intervals. There was no single gold standard model identified, with each study producing its own unique prediction model, dependent on cohort's characteristics and quality of the available data. While Cox regression modelling was the predominant method; the burgeoning research on the use of ML techniques to improve the prediction of CKD progressing towards ESKD [23]. However, the decision to use a particular modelling technique should depend on finding the most suitable model based on the type of data available, size and dimensionality [19].
The application of both traditional and ML techniques have been explored as a way of determining the most significant variables or features for inclusion in the model [56,70]. Studies that combined the use of both regression and ML techniques, first identified significant variables through regression prior to their inclusion into the development of a risk prediction model [56,70,78]. However, the practicality of determining significant features can be highly dependent on the availability and the quality of data. It is clear that the performance of a model is degraded if there is a lack of significant variables or if it includes irrelevant features [78][79][80]. Therefore, it is also recommended that future studies attempt to obtain whole population datasets that can help reduce the risk of missing data within the dataset and overcome the limitation of small study populations that are not generalisable to whole populations.
The study by Norouzi et al. demonstrated that an unsupervised adaptive neuro-fuzzy inference system (ANFIS), a type of neural network, was able to accurately predict GFR at sequential 6, 12 and 18-month intervals [71]. Other supervised non-ML models such as the KFRE and the ERBP algorithms, also produced results with high accuracy [63][64][65].
The comparison study by Dovgan et al. also demonstrated that features which correlated with a time approach produced the best results [70]. While the study did not include pathology results when developing their model, it produced the highest AUC via logistic regression, with XGBoost and Simple Gradient Descendent as a close second.
The MS-MSMs developed by Stephens-Shields et al. [72] developed a model that accounts for varying windows of time associated with different states while describing the effect of different exposures have on between states or endpoints. This is particularly applicable to the slow progression of CKD patients who enter the health system at different points in time and at various stages of the disease. In addition, each patient will have acquired different comorbidities and medical histories at different stages of their life.
Since the application of unsupervised and ML models are still in their exploratory stages, further research is required to investigate how these less explainable models manipulate very large and complex datasets that contain multi-dimensional and continuous variables [37,78] and reflecting their application to predict CKD progression.
The review revealed a lack of consistent reporting of the methodology used for development and validation of prediction models. This often led to under reporting of model development, which hinders the ability of researchers to do a true comparison and externally validate their predictive models against existing models. This was emphasised when almost one third of studies reviewed did not report on the eGFR formula used, and is a significant limitation towards the development of this area of research. The development of a standardised reporting statement has yet to be widely implemented among CKD progression research which may be due to its relative novelty in the area of predictive modelling and statistical research [35].
Few studies explained how they attempted to re-balance their data, and methods differed for each study including log transformations, data resampling techniques, running simulation studies, and applying inversely proportional weights to class frequencies [9,42,62,70]. The predictive models that have been developed are often difficult to implement locally as they lack information that allows clinicians to validate them. Limitations on data linkage within and between health organisations also contribute to the challenge of implementing this research, where siloed datasets are unlikely to be representative of whole populations. It is also recommended that future studies should include clear reporting of model development including any balancing of skewed datasets, steps to validate the model, and a description of how significant variables were chosen, which should theoretically at least include age, sex, eGFR (using a formula that provides reliable estimates for the study population), details on the population's characteristics, ACR, BMI and time-related variables if available.
A reliable risk prediction model for CKD progression would not only provide clinicians with earlier identification of CKD patients at greatest risk of progression, it would also enhance consultations and help clinicians determine suitable treatment options to improve patient outcomes [81,82].

Conclusions
Nephrology researchers are working towards producing an effective model to assist the detection of the risk of chronic kidney disease progression. The review highlights that supervised techniques, and more specifically, cox regression is the predominant model that is used to predict the progression of CKD. There were only a small number of studies in the review that used unsupervised and ML models, with the limited numbers making it very difficult to perform a comparison between these models. A more consistent and reproducible approach is required for future studies looking to develop risk prediction models for CKD progression. This would improve international collaborations and build upon the existing research to overcome the challenges to improve the effectiveness and reliability of these prediction models. Subsequently, this would also translate into enhanced health system planning, allocation of resources and improved health outcomes for CKD patients.
Supporting information S1 Checklist. Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist.